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ABSTRACT 

Advanced and effective collaborative filtering methods based 
on explicit feedback assume that unknown ratings do not 
follow the same model as the observed ones ( not missing at 
random ). In this work, we build on this assumption, and 
introduce a novel dynamic matrix factorization framework 
that allows to set an explicit prior on unknown values. When 
new ratings, users, or items enter the system, we can update 
the factorization in time independent of the size of data 
(number of users, items and ratings). Hence, we can quickly 
recommend items even to very recent users. We test our 
methods on three large datasets, including two very sparse 
ones, in static and dynamic conditions. In each case, we 
outrank state-of-the-art matrix factorization methods that 
do not use a prior on unknown ratings. 

1. INTRODUCTION 

Personalizing the user experience is a continuous growing 
challenge for various digital applications. This is of particu¬ 
lar importance when recommending releases on the Netflix 
platform, when digesting latest Yahoo news, or for helping 
users to find their next musical obsession. 

Among the different approaches towards personalization, 
matrix factorization ranges among the most popular ones 
[121130] . In this line of work, data is represented in the form 
of a user-item matrix, encoding user-item interactions in the 
form of binary or real values. Matrix factorization aims at 
decomposing a matrix into latent representations designed 
to accurately reconstruct observed interaction values. Most 
interestingly, these latent features are also used to predict 
missing (or unknown) ratings (i.e. if item j is exposed to 
user i, what would be his rating). However, by trying to 
predict the unknown ratings based on a model trained on 
the observed ratings, the recommender systems implicitly 
assume that the distribution of the observed ratings is rep¬ 
resentative of the distribution of the unknown ones. This is 
called the missing at random assumption m, and it is prob¬ 
ably a wrong asumption in most real-world applications. In 
the case of a movie recommender system, for example, users 
rate movies that they have seen, and their choices are biased 
by their interests. 

In this work, building on the not missing at random as¬ 
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sumption 1181125 ) we make the hypothesis that it is more 
likely for an unknown item to be weakly rated, this due to 
the huge amounts of existing items coupled to the limited 
number of items a user may be interested in. This translates 
into a strong prior suggesting that unknown ratings should 
be reconstructed from latent features as small values (i.e. 
close to 0). While this assumption may be wrong for spe¬ 
cific cases, such constraints act as a good regularizer that 
helps in significantly improving the recommendations. 

Our work is not the first to propose new interpretation 
of the missing data in a matrix factorization framework 
[HI ED] ED [28]. However, to the best of our knowledge, we 
are the first to propose an online learning mechanism that 
sets an explicit prior on unknown values and this, without 
any significant additional cost. We introduce a method to 
update our model each time a new rating is observed with a 
time complexity independent of the size of the data (i.e. the 
total number of users, items, and ratings). This fast update 
mechanism allows keeping the model up to date when a flow 
of new users, items and ratings enters the system. 

The contributions of this work are as follows: 

• We extend the squared loss, the absolute loss and the 
generalized Kullback-Liebler divergence to take into ac¬ 
count an explicit prior on unknown values. 

• For each loss function, we derive an efficient online learn¬ 
ing algorithm to update the parameters of the model 
with a complexity independent of the data size. 

• We validate the hypothesis that applying an explicit prior 
on missing ratings improves the recommendations in a 
static and in a dynamic setting on three public datasets. 

• Our methods are easy to implement and we provide an 
open-source implementation of the squared loss and ab¬ 
solute loss. 

The rest of this paper is organized as follows. Section [2] 
summarizes the recommendation problem and Section [3] for¬ 
mulates how to apply priors on unknown values in the con¬ 
text of recommendation. Section [4] extends three loss func¬ 
tions and shows how they can be optimized in a static and 
dynamic fashion. Section[5]presents our experimental results 
and Section [6] discusses works related to our study. Section[7] 
concludes this paper. 

2. THE RECOMMENDATION PROBLEM 

Before addressing the challenge of interpreting missing 
data, let us state the standard recommendation problem. 

We have at our disposal m items rated by n different users, 
where the rating given by the i th user to the j th item is 



denoted by r;j . In many real applications, these ratings take 
an integer value between 1 and 5. In this work, we assume 
that ratings are positive and that an item rated by user i 
with a high numerical value is preferred by this user over 
items she ranked with lower numerical values. We denote 
by 72 the set of all known ratings, and by 72,:. and lZ,j the 
set of known ratings of user i and item j, respectively. If 
Tij ^ 72. we say that the rating is unknown. 

For a while, the objective of recommender systems has 
been to predict the value of unknown ratings m- It is now 
widely accepted that a more practical goal is to correctly 
rank the unknown ratings for each user, while the actual 
value of the rating is of little interest 0 0 EH US] . This 
has led to a change in the way methods are evaluated (in 
terms of ranking metrics such as NDCG, AUC or MAP, 
instead of rating prediction metrics as measured by RMSE). 
We embrace that shift towards ranking, and the purpose of 
adding a prior on the unknown ratings is not to improve 
matrix factorization techniques in terms of RMSE, but in 
terms of ranking metrics. 

Matrix factorization methods produce for each user and 
each item a vector of k (<< n and m) real values that we 
call latent features. We denote by Wi the row vector con¬ 
taining the k features of the i th user, and hj the row vector, 
composed of k features, associated to the j th item. Also, we 
denote by W the n x k matrix whose I th row is w;, and H 
as the k x m matrix whose j th column is hj . Matrix fac¬ 
torization is presented as an optimization problem, whose 
general form is: 


arg min 

W.H 


E 

iJl'TijG'R. 


E 


(r^Wihj) +R(W,H) 


(1) 


where R is a regularization term (often Li or L 2 norms), 
and E measures the error that the latent model makes on 
the observed ratings. Most often, E is the squared error. 

Using a matrix factorization approach for predicting un¬ 
known ratings relies on the hypothesis that a model ac¬ 
curately predicting observed rating generalizes well to un¬ 
known ratings. In the following section, we argue that the 
former hypothesis is easily challenged. 


3. INTERPRETING MISSING DATA 

LaunchCast is Yahoo’s former music service, where users 
could, among other things, rate songs. In a survey of 2006, 
users were asked to rate randomly selected songs |18| . The 
distribution of ratings of random songs was then compared 
to the distribution of voluntary ratings. The experiment 
concluded that the distribution of the ratings for random 
songs was strongly dominated by low ratings, while the vol¬ 
untary ratings had a distribution close to uniform ITS] . 

Intuitively, a simple process could explain the results: users 
chose to rate songs they listen to, and listen to music they 
expect to like, while avoiding genres they dislike. Therefore, 
most of the songs that would get a bad rating are not volun¬ 
tary rated by the users. Since people rarely listen to random 
songs, or rarely watch random movies, we should expect to 
observe in many areas a difference between the distribution 
of ratings for random items and the corresponding distribu¬ 
tion for the items selected by the users. This observation 
has a direct impact on the presumed capacity of matrix fac¬ 
torization to generalize a model based on observed ratings 
to unknown ratings. 


Building on the not missing at random assumption : T51 
128] . we propose to incorporate in the optimization problem 
stated in Equation [T] a prior about the unknown ratings, 
in order to limit the bias caused by learning on observed 
ratings: 


arg min 

W.H 


+ a E E (r 0 ,w,hj^) + R(W,U) 


( 2 ) 


The objective function (Equation [2| has now two parts 
(besides the regularization): the first part fits the model to 
the observed ratings, and the second part drives the model 
toward a prior estimate f 0 on the unknown ratings. In ab¬ 
sence of further knowledge about a specific dataset, we sug¬ 
gest to use ro = 0, the worst rating, as a prior estimate. The 
coefficient a allows to balance the influence of the unknown 
ratings, and the original formulation is obtained with a = 0. 
We expect a to be small to deal with the problem of class 
imbalance. Indeed, in real-life applications the number of 
known ratings |72j is very small in comparison to the num¬ 
ber of unknown ratings (nm — |72|), and if a is close to 1, or 
larger, the second term of the objective function will com¬ 
pletely dominate the other parts and drive all the users’ and 
items’ features to zero. It is therefore important to find a 
right balance between the influence of the few known ratings 
and of the many unknown ones. 

In order to have a more intuitive feeling of the influ¬ 
ence of both parts of the objective function we introduce 
p = a(nm — |72|)/|72|, which can be interpreted as an influ¬ 
ence ratio between unknown and known ratings. If p = 0, 
the unknown ratings are ignored, if p = 1, both the known 
ratings and the unknown ratings have the same global influ¬ 
ence on the objective function, if p = 2, the unknown ratings 
are twice as important as the known ratings, etc. 

A more involved model could assume an adaptive p per 
user or item, which could lead to additional, albeit small, 
gains. However, this implies more parameters to tune, more 
cumbersome equations to explain and an involved process 
to prove that the complexity of the method remains the 
same. Due to limited space, instead, we provide a general 
demonstration of the method and leave the adaptive model 
for future work. 


4. LOSS FUNCTIONS 

An obvious difficulty raised by the new optimization prob¬ 
lem introduced earlier is the apparent increase in complexity. 
The naive complexity of evaluating this objective function is 
O(nmfc), while it is 0(\1Z\k) for classical matrix factorization 
approaches (Equation [I]). In this section, we demonstrate 
how it is possible to use our new model without the naive 
additional cost, and present a way to perform fast updates 
to incorporate new ratings in the model. 

To this end, we show the applicability of our method when 
E is the squared loss in Section [4. 1| and the absolute loss in 
Section [4. 2| For the sake of demonstration, we also discuss 
its applicability on the generalized Kullback-Liebler diver¬ 
gence in Section [4.3| Finally, in Section [4.4| we outline how 
the method can be enforced in a static setting, and a dy¬ 
namic setting with continuous updates of new ratings, items 
and users. 


4.1 Squared Loss 

By considering E as the squared loss, and R as the Li 
regularization, the optimization problem becomes: 


— = -2 Y [ r h - i 1 ~ a H h j] + 2aw iS h ( 8 ) 
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For the sake of simplicity, let us forget about the regu¬ 
larization term of the objective function for now (adding 
it to the following development is trivial), and let us call 
Z,(W,H,7?.) the objective function without regularization. 
We want to be able to update the features of one user or 
of one item in a time independent of the size of the dataset 
( n,m , \1Z\). In the remainder, we show that it is possible to 
compute dL/d\Vi and dL/dhj with a complexity linear in 
the number of ratings provided by user i (\7Zi,\) or given to 
item j (172..y |), respectively. On most datasets, and for most 
users and items, we have \lZi,\ -C m and \R—j\ -C n, and, 
therefore computing the gradient for one user or one item is 
fast. 

First, let us separate L in n blocks that contain only 
the terms of L depending on w;: 


Symmetrically, if S w = JE wfwwe have: 
lj = Y [( r « - ) 2 “ Q ( Wih J) 2 ] + ahjS w hJ ^ 

i\rij £TZ 

and: 

J^;=-2 Y [^-(l-a)wihj] w i + 2ah 7 S w (10) 

i\rij£TZ 

Assuming that S w is known, the complexity of computing 
lj or dL/dhj is now 0(\1Z.j\k + fc 2 ), and the complexity of 
computing it for every j € {1,..., m} is 0(\lZ\k + k 2 ). 

4.2 Absolute Loss 

A similar development can be done when the squared loss 
is replaced by the absolute loss. With the absolute loss, L 
becomes: 


E 


~ij - W,h, 


+ q y 


, T 

Wjh, 


As with the squared loss, we divide L into 17 and l 1 /. 


r = 

•' 7 . 


E ('E " hj) + a Y2 ( Wih J) 


(4) 
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Notice that we have: 


L = Y^ lT an< I 
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dL 

dwi 


dwi 


If we adopt a naive computation, the second term of Equa¬ 
tion 0 is more time expensive because most items are not 
rated by the user. However, the sum on unknown ratings 
(i.e. £E| r . can be formulated as the difference between 
the sum on all items (i.e. an d the sum on ra t e d items 

only (i.e. JE| r . ^n) • By so doing, the sum on unknown rat¬ 
ings disappears from the computations: 


As with the squared loss, we will change the expression 
of If to remove the sum over all unknown ratings, but in 
this case we have to impose non-negativity of the features 
to go further. If W, H > 0, we have |wjhj| = w,hj, and 
therefore: 


E | Wih f] = E Wih E E w * h J 

j\r ij0z 3 =1 jhijeiz 


( 11 ) 


= w i 



- E w ' h J 

j\ r i j£ 1 Z 


( 12 ) 


E (”.tI) 2 -E(w ( hj) 2 - e (“'Hi ) 2 P) 

iVijt'R- i =1 jhyeT?. 

= w ! S h wf- Y, ( w i h j) ( 6 ) 

j\ r ij 

where we have posed S h = JE hjhj, a k x k matrix inde¬ 
pendent of i (i.e. it is the same matrix for all l J"). Assuming 
that S h is known, we can now compute If and <9L/<9w; with 
a complexity of 0(\lZi.\k + fc 2 ). From Equations [ 4 ] and [6] 
we obtain: 

If = Y, [( r b' - w ihJ) 2 - a(wihj) 2 ] + aw;S h wf ^ 

j\ r ij 6R 


Here, instead of S w and S h , we will define s w = X)T=i Wi 
and sh = //ZJLi hj- We can now express If and dL/dwt 
efficiently: 

17= Y1 (| r ^' - w;hj| - awihjj + a-WiSh (13) 

j\rij£TZ 

so that: 

JE = Y ( sign ( Wih I ~ r «) - «) h .? + as h (14) 

* t|i-ijCK 

where sign(a:) = x/\x\ if x ^ 0, and equals 0 otherwise. 
Assuming is known, the complexity of computing 1™ or 
dL/dwi is now 0(\lZi,\k). The corresponding expression of 
lj and dL/dhj is trivial, and the complexity to compute 
them is 0(\1Z,j\k). 


We can easily derive: 


4.3 Generalized Kullback-Leibler Divergence 

For the sake of demonstration on other common loss func¬ 
tions in matrix factorization, we show here the applicability 
of the sparsity trick on the generalized Kullback-Liebler di¬ 
vergence (GKL) [ 141 fl6] . We do not elaborate further on 
this function in the rest of the paper. 

The generalized Kullback-Liebler divergence is defined as 
follows: 


/>! /•;/ w.lij) = nj log(-^y) - rij + w,hj (15) 

w ' n j 

The GKL is not defined when rij = 0. In the following 
we extend the GKL by using its limit value: 

D(0||w;hj) := lim D(r||wjhJ) = (16) 

Using Equation |15| and |16| L becomes: 

L = J2 ( r y lo g(^T)-^j+Wihj) +a J2 w ' h l 

We now follow the same development as with the other 
losses. We define If: 

C = Y ( r ij l°g( Yt~t ) ~ r L + WjhJ + a Y w ’ h I 

f|ry6 Tl \ 1 i ) iVijfR- 

In the case of the GKL, the process to remove the sum 
on unknown ratings is the same as with the absolute loss, 
except that in absence of absolute value we do not have to 
impose non-negativity of the features: 


Y wih J = WiS h - Y wih I ( 1T ) 

This leads to: 


l i ® Y r ^ log (^r)- ri f+ (! 

+ aw.Sft 



(18) 


Now we can easily derive dL/d\Vi\ 


dL 

dwi 




(19) 


The corresponding expression of lj and dL/d hj is ob¬ 
tained symmetrically. As with the absolute loss, the com¬ 
plexity of computing If or dL/dwi is now 0(\lZi,\k) (it is 
O( 1 |At) for lj and dL/dhj). 


4.4 Static and Dynamic Factorization 

We introduce an online algorithm to learn the latent fac¬ 
tors from the input data in a static setting, and show how 
it can accommodate updates in a dynamic setting. 


Algorithm 1 Randomized block coordinate descent 

Require: 

The ratings 77. 

The number of features k. 

1: Initialize W and H. 

2: Compute S w and S h (s„ and s^) 

3: while not converged do 

4: for all user and item, traversed in a random order do 

5: In the case of a user ( i ) do 

6: Perform a gradient step on w, using line search 

7: Update S w (s w ) 

8: In the case of an item ( j ) do 

9: Perform a gradient step on hj using line search 

10: Update S h (s h) 

11: end for 

12: end while 


4.4.1 Static Factorization 

In order to factorize a whole new set of data we propose 
to use a randomized block coordinate descent [26.'. At each 
iteration, all the users and items are traversed in a random 
order. For each of them a gradient step is performed on their 
features while keeping the other features constant. 

We can use a line search [2 to determine the size of the 
gradient step because the variation of L for a modification 
of Wj is entirely determined by If and can therefore be com¬ 
puted efficiently. Line search allows to avoid the burden of 
tuning the step size, proper to stochastic gradient descent 
(SGD) methods [7[. Moreover, using line search guaran¬ 
tees the convergence of the value of the objective function. 
Indeed, each gradient step decreases (or rather cannot in¬ 
crease) the objective function which is bounded from below. 
This implies that the variation of the objective function con¬ 
verges to zero. 

The complete procedure for the factorization through ran¬ 
domized block coordinate descent is summarized in Algo¬ 
rithm [l] 

Complexity. In the case of the squared loss, the compu¬ 
tation of fast gradient step relies on knowing S w and S h . 
Their initial value is computed in 0(nk 2 ) and 0(mk 2 ), re¬ 
spectively, and the cost of updating them after each gradient 
step is 0(k 2 ). The total complexity of an iteration of our 
algorithm is therefore 0(\lZ\k + (n + m)k 2 ), as good as the 
best factorization methods that do not use priors on un¬ 
known ratings [9[- 

In the case of the absolute loss and generalized Kullback- 
Liebler divergence, the computation uses s,„ and s^. Their 
initial value is computed in O(nfc) and 0(mk), while the 
cost of updating them is O(k). The total complexity of one 
iteration then becomes 0(\lZ\k + (n + m)fc), which is lower 
than the squared loss’ complexity. However, this usually 
comes at a cost on the performance of the results, as we will 
show in the experiments in Section [5] 

4.4.2 Fast Updates 

The expressions of Z“, Z^, and their gradients (Equations 
0, Q, 0 and |0| ) allow us to compute the latent repre¬ 
sentations of one user or one item in a time independent of 
the number of users and items in the system. We can use 
that ability to design a simple algorithm for updating an 
existing factorization when a new rating is added to 77: If 












Algorithm 2 Update algorithm 

Require: 

The new rating Tij. 

The ratings of user i (' TZi .) and of item j (' 1Z,j). 

1: If Wi (hj) does not exist, initialize it (for example by 
setting a random feature to 1). 

2: Add rij to TZi, and lZ,j. 

3: while not converged do 

4: Perform a gradient step on w; using line search 

5: Update S w (s m ) 

6: Perform a gradient step on hj using line search 

7: Update S h (s h) 

8: end while 


user i rates item j, we iteratively perform gradient steps for 
Wi and hj, keeping all other features constant. This relies 
on the assumption that a new rating will only affect signif¬ 
icantly the user and item that are directly concerned with 
it. Although this assumption can be disputed, we will show 
in our experiments (Section |5.3[ ) that our update algorithm 
produces recommendations of stable quality, indicating that 
limiting our updates to the directly affected users and items 
does not degrade the factorization over time. 

When ratings are produced by new users or given to new 
items, a new set of features for that user or item is created 
before performing the local optimization. Various initializa¬ 
tion strategies could be explored here. However, as we show 
in our experimental results, assigning a random value to one 
of the features and setting the others to zero performs well in 
practice. The update procedure is summarized in Algorithm 

m 

Complexity. As mentioned earlier, our update algorithm 
is independent of the number of users or items in the sys¬ 
tem, making it suitable for very large datasets. Each itera¬ 
tion of the update algorithm is composed of two gradient 
steps (one on the user’s features, and one on the item’s 
features). In particular, the complexity of one iteration 
is 0((\TZi,\ + |77..j|)fc + fc 2 ) for the squared loss, and only 
0((\lZi.\ + \lZ,j\)k) for the absolute loss and the GKL. This 
difference in complexity becomes significant when k is large 
with regards to the average number of ratings per user and 
per item. 

Updates based on classic SGD methods have an even smaller 
complexity (O(fc))), but we will show in Section [5] that our 
method produces recommendations of much higher quality, 
while still being able to satisfy applications requiring low- 
latency updates. 

5. EXPERIMENTS 

We perform several experiments to demonstrate the fol¬ 
lowing key points: 

• Using priors on the unknown values leads to overall im¬ 
proved quality of ranking, in a static or dynamic setting. 

• The quality does not degrade with time, i.e., as more 
updates are added, the model does not lose accuracy. 

• Our methods can outperform traditional techniques on 
various large datasets. 

In our experiments, we test the performance of the squared 
loss (SL) and the absolute loss (AL) with and without prior 
on unknown values. In Section |5.1| we describe our exper- 


Table 1: Characteristics of the datasets used. 


Dataset 

Users 

Items 

Ratings 

Movielens 

6,040 

3,706 

1,000,209 

FineFoods 

256,059 

74,258 

568,454 

AmazonMovies 

889,176 

253,059 

7,831,442 


imental setup: the benchmarked datasets used, the perfor¬ 
mance metrics recorded and how we tune the various param¬ 
eters of the models tested during the experiments. Then, in 
Sections l5.2l and l5.3l we describe the results of our methods in 
a static and dynamic learning setting, respectively, and how 
they compare with state-of-the-art methods. In Section [5.4| 
we illustrate the importance of fast updates by studying the 
impact of having a delay between the arrival of new ratings 
and the update of the factorization. In Section [5.5| we inves¬ 
tigate in depth the influence of parameter values selected in 
the two loss functions (squared and absolute loss). Finally, 
details allowing the reproducibility of the results are given 
in Section [5761 

5.1 Experimental Setup 

Here we briefly describe the experimental setup used for 
the static and dynamic learning and how the parameters of 
the different methods are tuned. 

5.1.1 Datasets 

During the experiments, we use three datasets with dis¬ 
tinct features. Table |T] summarizes the characteristics of 
these datasets which provide different challenges to the rec¬ 
ommendation task: 

• Movielens: This is the well-known movie ratings dataset 
produced by the Grouplens project. We use the version 
containing 1 million ratings, with at least 20 ratings for 
each user. 

• FineFoods: This is a collection of ratings about food 
products extracted from the Amazon comments [19]. The 
dataset is much sparser than Movielens, with most users 
having only a handful of ratings, making it a very hard 
dataset for the recommendation task. 

• AmazonMovies: This is a larger collection of ratings ex¬ 
tracted from the movie section of Amazon [T9]. This 
dataset is also sparser than Movielens, although not as 
sparse as FineFoods. 

5.1.2 Evaluation Metrics 

We measure two standards metrics used in ranking evalua¬ 
tion: (1) Normalized Discounted Cumulative Gain (NDCG) 
[TlfTS| and, (2) area under ROC curve (AUC) pT [2iTl \T7 '.. 

NDCG rewards methods that rank items with the highest 
observed rating at the top of the ranking. The discounted 
aspect of NDCG comes from the fact that relevant items 
ranked at low positions of the ranking contribute less to the 
final score than relevant items at top positions. 

In the static experiments, we also report the NDCG com¬ 
puted on the rated items only. This metric does not con¬ 
sider the real world case scenario which consists of ranking 
all items since we do not know in advance which item will be 
rated or not. Intuitively, by biasing our objective through 
the introduction of priors on unknown rating we may loose 
performance when ranking rated items only, while perform- 










Table 2: List of the parameters of each method, and set of 
values tested during the parameters tuning of the squared 
loss (SL) and absolute loss (AL), with and without prior 
on unknown values, as well as the multiplicative update al¬ 
gorithm (Mult-NMF), Alternating Least Square (ALS-UV), 
and Vowpal Wabbit (VW). k: number of features, A: regu¬ 
larization coefficient, p: unknown/known influence ratio, 7 : 
learning rate. 


Method 

Parameter 

Tested Values 

SL/AL 
with prior 

k 

A 

P 

5, 10, 20, 50, 100, 200 

0 , 0 . 01 , 0 . 1 , 1 , 10 

0.3, 0.7, 1, 2 

SL/AL with- 

k 

5, 10, 20, 50, 100, 200 

out prior 

A 

0 , 0 . 01 , 0 . 1 , 1 , 10 

ALS-UV 

k 

A 

20, 50, 100, 200, 500 

0, 0.001, 0.01, 0.05, 0.1 

Mult-NMF 

k 

20, 50, 100, 200, 500 


k 

20, 50, 100, 200, 500 

VW 

A 

0 le-5 le-2 


7 

0.01, 0.02, 0.05, 0.1, 0.2 


ing better when considering all the items. 

We use AUC to evaluate the ability of the different meth¬ 
ods to predict which items are going to be rated. AUC mea¬ 
sures whether the items whose ratings were held out during 
learning are ranked higher than unrated items. The perfect 
ranking has an AUC of 1, while the average AUC for random 
ranking is 0.5. 

5.1.3 Parameter Tuning 

Table [2] shows the parameters of the various models and 
the values tested during parameter tuning. For each test, 
the parameters’ values producing the best ranking on the 
validation sets (measured by NDCG for the static test and 
AUC for the dynamic test) were selected to be used. See 
Sections |5.2| and |5.3| for the description of the validation 
sets. 

5.2 Static Learning 

Research question. In a static mode, we test to which ex¬ 
tent using a prior on unknown ratings improves the ranking 
of items when recommended to users. 

Process followed. The test set was constructed by ran¬ 
domly selecting 1000 users, and splitting the ratings of those 
users in half, keeping the first 50% of the ratings in the 
training set, according to timestamp, and the last 50% in 
the test set. The same process (selecting 1000 users and 
splitting their ratings) was then applied three times on the 
training set in order to create three training/validation pairs 
of sets. On each run, the parameters producing on average 
the best NDCG over the three validation sets were then used 
to factorize the full training set, and evaluated on the test 
set. 

Baseline. We report the results achieved by two traditional 
well-known algorithms: UV matrix decomposition solved 
with Alternating Least Square (ALS-UV) [30), and non¬ 
negative matrix factorization with the multiplicative update 
algorithm (Mult-NMF) [H] . Both ALS-UV and Mult-NMF 


Table 3: Comparison of our introduced algorithm in static 
learning on the datasets Movielens, FineFoods and Amazon- 
Movies. Values in bold hold for the method that outperform 
all the other methods according to a Mann-Withney U test 
with a confidence level of 1%. Average values are shown 
alongside their standard deviation over 10 runs. 



NDCG-RI 

Movielens 

NDCG 

AUC 

SL w/ prior 
SL w/o prior 
AL w/ prior 
AL w/o prior 
Mult-NMF 
ALS-UV 

0.885 ± 0.0014 
0.886 ± 0.0015 
0.8683 ± 0.0030 
0.8794 ± 0.0031 
0.8433 ± 0.0007 
0.8332 ± 0.0014 

0.5046 ± 0.0013 

0.3597 ± 0.0012 
0.4452 ± 0.0009 
0.3801 ± 0.0106 
0.3758 ± 0.0006 
0.3292 ± 0.0004 

0.8695 ± 0.0012 

0.6548 ± 0.0014 
0.8134 ± 0.0011 
0.6927 ± 0.0322 
0.7011 ± 0.0009 
0.5839 ± 0.0005 


NDCG-RI 

FineFoods 

NDCG 

AUC 

SL w/ prior 
SL w/o prior 
AL w/ prior 
AL w/o prior 
Mult-NMF 
ALS-UV 

0.887 ± 0.0016 
0.888 ± 0.0158 
0.8722 ± 0.0142 
0.8730 ± 0.0260 
0.8476 ± 0.0084 
0.8653 ± 0.025 

0.1237 ± 0.0039 

0.1023 ± 0.0022 
0.1026 ± 0.0030 
0.0923 ± 0.0008 
0.0830 ± 0.0008 
0.0873 ± 0.0009 

0.8452 ± 0.0074 

0.8314 ± 0.0058 
0.8412 ± 0.0047 
0.7294 ± 0.0143 
0.3403 ± 0.0052 
0.5485 ± 0.0114 


NDCG-RI 

Amazon Movies 
NDCG 

AUC 

SL w/ prior 

SL w/o prior 
AL w/ prior 
AL w/o prior 
Mult-NMF 
ALS-UV 

0.8992 ± 0.0101 
0.9035 ± 0.0089 
0.8804 ± 0.0077 
0.8854 ± 0.0102 
0.8498 ± 0.0026 
0.8658 ± 0.0034 

0.1887 ± 0.0088 

0.1103 ± 0.0008 
0.1348 ± 0.0035 
0.1002 ± 0.0012 
0.0959 ± 0.0004 
0.0906 ± 0.0003 

0.9276 ± 0.0031 

0.8656 ± 0.0033 
0.8634 ± 0.0045 
0.7625 ± 0.0051 
0.6330 ± 0.0040 
0.6601 ± 0.0061 


use the squared loss. 

Results. The results, averaged over 10 runs, are shown in 
Table [3] We can observe that for both the squared loss and 
the absolute loss, and on all datasets, by adding a prior on 
the unknown ratings we improve significantly the rankings of 
the items recommended to users over rankings obtained by 
the same techniques when they do not put a prior on the un¬ 
known ratings (and also over rankings obtained by state-of- 
the-art approaches ALS-UV and Mult-NMF). In particular, 
our implementation of the squared loss with prior outper¬ 
forms all other methods, as confirmed by a Mann-Withney 
U test with a confidence level of 1%. 

On the Movielens dataset, the results of Mult-NMF and 
ALS-UV are, as expected, similar to the ones of our imple¬ 
mentation of the squared loss without prior. Indeed, those 
methods optimize the same objective function, and differ 
only by their algorithm. Interestingly, on the sparser Fine- 
Foods and Amazon Movies dataset, our randomized block co¬ 
ordinate gradient descend method outperforms Mult-NMF 
and ALS-UV, even without prior on the unknown ratings. 

Furthermore, in the three tested data sets, when only con¬ 
sidering the rated items, the loss in ranking performance is 
never significant (see NDCG-RI with and without prior in 
Table |3|. In other words, while improving on the global 
ranking, the performance does not deteriorate when consid¬ 
ering only the subset of rated items. 

5.3 Dynamic Learning 

Research question. In this section, we target two research 
questions: 

1. We test whether our update algorithm is able to sus¬ 
tain stable quality of recommendations over time; 

2. We test to which extent using priors on unknown rat¬ 
ings improves the ranking of recommended items when 
















the model is updated each time a new instance is en¬ 
countered. By so doing, the system is evaluated on 
more realistic scenarios where the cases of cold items 
and users are considered as well. 

Process followed. We order the ratings by timestamps 
and separate the ratings in three blocks: first the training, 
then the validation and finally the test block (see Table [4] 
for the size of each block in the different datasets). 

Table 4: Number of ratings in each block for the dynamic 
learning. 


Dataset 


Training Validation 


Test 


Movielens 500,000 100,000 100,000 

FineFoods 400,000 100,000 68,000 

Amazon Movies 5,000,000 1,000,000 1,000,000 

The evaluation is performed as follows: an initial model is 
built based on all the ratings present before the test block, 
then, for each rating of the test block, two steps are per¬ 
formed in the following order: 

1. The current model is evaluated by computing the AUC 
over the new (user, item) pair. Notice that in this case, 
computing the AUC means computing the proportion 
of items not yet rated by the user that the model ranks 
lower than the item that was just rated. An AUC of 1 
means that the new item was the top recommendation 
of the method for that user. 

2. The model is updated using the new rating. It is worth 
noticing that the rating may concern a new user or 
item, and, therefore, features for that new user/item 
have to be added to the model. 

Parameter tuning is done as described above, but starting 
at the beginning of the validation block and ending before 
the test block. The values of parameters tested are the same 
as in the static test (see Table [2j|. 

Baseline. We compared our methods to Vowpal Wabbit 
(VW). VW is a machine learning framework solving differ¬ 
ent optimization problems for classification and ranking, by 
implementing a carefully optimized, stochastic gradient de¬ 
scent (SGD) using feature hashing 315] and adaptive gra¬ 
dient steps [7]. We are using the VW’s implementation of 
low-rank interaction^] based on factorization machines [22] . 
Results. Figure [I] shows how the average AUC evolves as 
new ratings enter the system. We first observe that the 
quality of the results does not decrease over time, indicating 
that our update algorithm can work for long periods of time 
without propagating or amplifying errors. As in the static 
experiment, we confirm that adding a prior on unknown 
ratings improves the quality of the ranking and, again, this 
is maintained across time. Moreover, the SGD approach of 
VW is outranked in each dataset by our approach with prior. 

5.4 Impact of delayed updates 

Research Question. We test the performance of delayed 
models produced by our methods in delivering recommen¬ 
dations to users. 
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1 https://github.com/JohnLangford/vowpal_wabbit/ 
tree/master/demo/movielens 


Figure 1: Performance comparison with respect to average 
AUC for the various methods tested in dynamic learning on 
Movielens, FineFoods and AmazonMovies. Results are aver¬ 
aged over 20 runs. 


Process followed. In order to address this question, we 
simulate a recommender system that is not able to incor¬ 
porate new ratings in the model as soon as they enter the 
system. To do so, we modify the process of dynamic learn¬ 
ing presented in Section [5.3| to impose a delay between the 
arrival of a new rating and the update of the factorization. 
More precisely, after the i th rating is given by a user, the 
model is updated up to the (i — d) th rating (d being the 
arbitrary delay). This way, the model is always d ratings 
behind the last one arrived (the ratings are sorted by real 
time of arrival). In real applications, the delay would prob¬ 
ably vary, depending on the level of activity of the users. 
However, this experiment gives a first impression of the im¬ 
pact of delays on the recommendation task. 

Results. Figure [2] shows the impact of a delay on the av¬ 
erage AUC of the squared loss and absolute loss with prior 
for a dense dataset like Movielens and a sparse dataset like 
FineFoods. We observe that even a small delay can affect 
the quality of the recommendation, depending on the char¬ 
acteristics of the data. For Movielens, if the model is behind 
by 5 — 10 ratings, the average AUC drops by 3%, and it goes 
down by about 14% when the model is behind by 1000 rat¬ 
ings, and this applies to both loss functions. On the other 
hand, for the much sparser FineFoods, the effect is more ap¬ 
parent. With only 5—10 ratings behind, the model’s AUC 
already drops by 10%. 

To show the effect of fast updates on weakly-engaged (or 
cold) users, we also report the impact of delays on those 
users for both Movielens and FineFoods with the squared 
loss which performs best (Figure [ 2 ] Cold Users). We define 
such users as the ones that rated at most two items. As 
hypothesized, the cost induced by delayed predictions (for 
five ratings delayed) is higher for cold users. We observe a 
relative drop in AUC of 11.8% and 13.4% for weakly-engaged 
users on Movielens and FineFoods, respectively, while when 
considering all the users, the relative drop is 1.1% and 12.9%, 
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Figure 2: Average AUC of the squared loss and absolute 
loss with prior on Movielens and FineFoods for various delays 
d, imposed as a number of ratings that the model is behind 
the current rating. 


respectively. 

In such sparse scenarios, cold users perform only a handful 
of actions before deciding to abandon the site or not. There¬ 
fore, it is important to consider cold users in the model as 
soon as they arrive, to keep them engaged by fast, efficient 
and good recommendations. 

5.5 Parameter Analysis 

Research Question. We test to which extent the number 
of features (fc), the weight of the prior ( p) and the regular¬ 
ization coefficient (A) affect the AUC and the runtime per 
update on our loss functions. 

Figure [3] shows the results of this investigation for differ¬ 
ent values of these parameters, for both squared loss and 
absolute loss and on each dataset. The results are obtained 
using the dynamic learning process. 

Number of features. Concerning the quality of ranking 
(AUC), we observe the usual overfitting/under-fitting trade¬ 
off (Figure [3](a) ). The optimal number of features depends 
on the dataset as well as on the loss function used, suggesting 
that a careful tuning of that parameter is always needed. 

In some cases, speed constraints will force the use a sub- 
optimal number of features. Indeed, the update runtime 
heavily depends on the number of features. Figure[3](d) sug¬ 
gests a linear relationship between runtime and number of 
features. For both losses there is indeed a linear role of the 
number of features in the theoretical complexity (Section 
). Notice, however, that the theoretical complexity of the 
squared loss also has a quadratic term that becomes domi¬ 
nant for large number of features (with regards to the num¬ 
ber of ratings per users). Also note that while the squared 
loss produces better AUC, the absolute loss is able to sus¬ 
tain higher update rates, and can therefore be the loss of 
choice when speed is the first criterion. 

Regularization coefficient. The influence of A seems 
rather limited, except for high values that cause both the 
AUC and the update runtime to drop (Figur^3|b) and (e)). 
A small regularization is supposed to increase the quality of 
the model by reducing overfitting, but this effect is not visi¬ 


ble here. The reason may be that the role of regularization is 
already taken by the prior on unknown ratings. Introducing 
the prior seems to have the side effect of making the regu¬ 
larization obsolete (or redundant). In fact, we confirm this 
with the results for A = 0 which demonstrate no impact on 
the quality or runtime. Again, we see that setting a prior on 
unknown ratings increases the quality of recommendations 
without increasing the complexity of the solution. While it 
adds a term and a parameter to the objective function, it 
allows to remove one and its associated parameters. 

Unknown/known influence ratio. The ratio p influences 
the performance of the squared loss algorithm in the follow¬ 
ing way: the AUC increases when a prior on unknown values 
is added (p > 0), but the exact value of p has little influence 
(in the observed range) (Figure [3](c) ). The absolute loss is 
more sensitive to the value of p, with the AUC decreasing 
when p becomes too large (on Movielens and Amazon Movies). 
However, in both cases, and on all datasets, giving the same 
weight to the known and unknown ratings (p = 1) offers a 
significant improvement over not using a prior, suggesting 
that p = 1 can be used as a first guideline, avoiding the 
burden of further parameter tuning. 

The update runtime is also affected by p, decreasing when 
p increases (Figure [3]T)). The explanation can be that the 
prior on unknown ratings acts as a regularizer, driving fea¬ 
tures towards 0, and in doing so speeding up the conver¬ 
gence. 

Runtime. In general, our technique demonstrates low run¬ 
ning time which is heavily dependent on the number of fea¬ 
tures used, and less on the regularization applied or the ra¬ 
tio of unknown over known values. These results demon¬ 
strate that our method can satisfy applications requiring 
low-latency updates. 

5.6 Reproducibility of Results 

The implementation of the algorithms introduced in Sec¬ 
tion [4] is available on Github: 

https://github.com/rdevooght/MF-with-prior-and-updates 

For both ALS-UV and Mult-NMF we use the implementa¬ 
tion of GraphChi, an open source tool for graph computation 
with impressive performance M- 

The code and documentation of Vowpal Wabbit is avail¬ 
able on its Github page: 

https://github.com/JohnLangford/vowpal_wabbit/wiki 

The Amazon datasets are available on the SNAP webpage: 
http://snap.Stanford.edu/data/index.html 

The Movielens dataset is available on the Grouplens page: 
http://www.grouplens.org/datasets/movielens/ 

6. RELATED WORK 

The problem of recommending products based on the ac¬ 
tions and feedback from other users (rather than based on 
content similarity) is often called collaborative filtering, and 
dates back 20 years ago, with works such as Tapestry m 
and Grouplens [2F;. The field is now dominated by meth¬ 
ods based on matrix factorization, with algorithms such 
as ALS m , the multiplicative update rule DU, and the 
stochastic gradient descent method (SGD) [9l 112j. 

The missing at random assumption has yet to get the 
attention it deserves in collaborative filtering. Both [181 
and [28] have validated the hypothesis of ratings missing 
not at random. Practical propositions for the interpretation 
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Figure 3: Influence of parameter values on the AUC and runtime for the models produced by the squared loss (SL) and 
absolute loss (AL). Y-error bars declare a standard deviation on the average value of each metric. 


of missing data can be found in the fields of one-class collab¬ 
orative filtering and collaborative filtering based on implicit 
feedback, where the missing at random assumption is often 
obviously untenable [111 [201 [21 j . m offers an interesting 
approach where missing ratings are considered as optimiza¬ 
tion variables; they use an EM algorithm to optimize in 
turn the factorization and the estimation of missing values. 
Unfortunately, that method has a high complexity, and the 
proposed approximations that work with large problems re¬ 
move some of the method’s appeal. 

None of those works, however, consider the real world, dy¬ 
namic scenario of continuously observing new ratings, users 
and items. Other works HE! focus on the dynamic update 
of matrix factorization (mainly through the use of SGD), 
but those, on the other hand, implicitly rely on the missing 
at random assumption, and therefore suffer from lower ac¬ 
curacy in predictions. Other state-of-art methods for matrix 
factorization scale by relying on stochastic gradient compu¬ 
tation EH, while we rely on exact gradient approach. In 
this work, at the difference of what is mostly seen on scal¬ 
able machine learning techniques nowadays [6], we base our 
approach on coordinate random block descent to compute 
exact gradient in order to deal with missing data of large 
scale matrices. 

7. CONCLUSIONS 

In this work we proposed a new, simple, and efficient, way 
to incorporate a prior on unknown ratings in several loss 
functions commonly used for matrix factorization. We ex¬ 
perimentally demonstrated the importance of adding such a 
prior to solve the problem of collaborative ranking. We also 


tackled the problem of updating the factorization when new 
users, items and ratings enter the system. We believe that 
this problem is central to real applications of recommenda¬ 
tion systems, because new users constantly enter those sys¬ 
tems and the factorization must be kept up to date to give 
them recommendations immediately after their first few in¬ 
teractions with the platform. We offer an update algorithm 
whose complexity is independent of the size of the data, 
making it a good approach for large datasets. In the future, 
we would like to explore how our methods perform under 
real workloads of updates with variable arrival rates of rat¬ 
ings per user and item. Furthermore, we would like to test 
the performance of our methods in platforms built to ana¬ 
lyze streams of data such as Storm, Twitter’s Distributed 
Processing Engines platform. 
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